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Abstract 


We discuss a metric structure on the set of partitions of a finite 
set induced by the Gini index and two applications of this metric: the 
identification of determining sets for index functions using techniques 
that originate in machine learning, and a data compression algorithm. 
Keywords: Gini index, Vapnik-Chervonenkis dimension, index func- 
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1 Introduction 


The Gini index was developed as a measure of wealth inequality by the 
Italian statistician Corrado Gini [1,2] and became increasingly important 
in machine learning. The Gini index is related but distinct from Shannon 
entropy (since it belongs to the same family of measures of diversity of 
probability distributions) and can be given an algebraic treatment that is 
useful in our context. 

We discuss two rather distinct problems where the Gini index and a 
metric induced by this index on the set of partitions of a finite set prove to 
be useful, namely, the identification of determining sets for index functions, 
and a compression algorithm. 

Index functions were introduced and studied by T. Sasao in a series of 
papers [3-10,15,16] and have multiple applications including terminal access 
controllers, IP address table lookup, packet filtering, memory patch and virus 
scan circuits, fault maps for memory, etc. In general, the number of variables 
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is large and these functions do not depend effectively on all their variables. 
Therefore, identification of sets of minimal sets of variables on which such 
functions depend (known as determining sets) may lead to simplification of 
circuits that implement these functions. We investigated the identification of 
determining sets for index functions in a previous contribution and proposed 
an Apriori-like algorithm [17]. 

Let S be a finite set and let P(S') be the collection of its subsets. A 
partition of S is a collection 7 of pairwise disjoint, non-empty subsets of S, 
{Bi,...,Bm} such that Uj", Bi = S. The sets By,...,Bm are the blocks 
of 7. The set of partitions of S is denoted by PART(S). 

A partial order relation is introduced on PART(S). For 7,0 € PART(S) 
we write a < o if each block of 7 is included in a block of co. It is easy to 
see that this is equivalent to asking that each block of o is a union of blocks 
of 7. The largest partition in PART(S) is the one-block partition ws = {S}; 
the smallest partition is ag = {{x} | « € S} that consists of singletons. 

If 7,0 € PART(S), the partition 7 Ac is the partition of S that consists 
of sets of the form B; Cj, where B; € 7, Cj € o, and Bs C; #0. Clearly, 
we have tAo <7, andaAa<o. Also, p< 7 and p < a if and only if 
p<taa. 

For U,V € P(S) denote by U@V the symmetric difference of the sets U 
and V. We have 

USV|=|U|+|V| -2|U NVI. 


The mapping d : P(.S)? —> Ryo defined as d(U,V) = |U @ V| is a metric on 
P(S). In other words, we have d(U,V) = d(V,U), d(U, V) = 0 if and only if 
U =V, and d(U,V) < d(U,W) + d(W,V) for every U,V,W € P(S). 

The paper is structured as follows. In Section 2 we discuss the metric 
space of partitions of finite sets. Then, in Section 3 we establish a link 
between the Vapnik-Chervonenkis dimension of collections of sets and the 
size of determining sets for index function. An algorithm for data compression 
based on the Gini index is presented in Section 4. Finally, we present our 
conclusions in Section 5. 


2 The Metric Space of Set Partitions 


For a partition  € PART(S) let P, be the equivalence relation defined 
by z that consists of all pairs (2,y) € S x S such that x and y belong to 
the same block B; of 7. In other words, for 7 = {B; | i € I} we have 
P, = User(Bi x B;). 
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For 7,0 € PART(S) it is clear that 7 = o if and only if P,; = P,. Fora 
finite set S' we define a metric on PART(S) as 


1 


(1,0) = Tap 


1 
|Px ® Pol =~ (\Prl + [Pol 2|P, Pol) , 


where “@” denotes the symmetric difference of two sets and n = |S}. 
The Gini index of the partition 7 = {B,,..., By»,} is the number 


m 


P, B; 
gini(z) = 1 | Ly S~ 


n2 


that is, the relative number of pairs that do not inhabit the same block of 
the partition 7. 

The largest value of gini(z) for a partition in PART(S) that has m 
blocks is obtained when all blocks have equal sizes and equals 1 — + (when 
n =|S| is a multiple of m). The least value is obtained when 7 consists of 
m — 1 blocks of size 1 and one block of size n — m+ 1 and equals 


m-1 (n=m+1)* _ (m=VQ2n—m) 


1 2 


n n2 n2 


Let now 7 = {Bj,...,Bm},o = {Ci,...,Cp} be two partitions of a 
set S and let tAq be the partition of S whose blocks are the non-empty inter- 
section B;NC; of blocks of t and 0. We have Pryrg = Py P,. Denote a block 
B; N C; by Dj;. If dij = |Di;| we have |B; a ar | Diy and IC;| = ra | Di]. 


2 
Therefore, |Prac| = doj21 ya Di?, |Pr| = el P35) , and 


|Po| = 2 (2, |Dijl)?. This allows us to write 
6(1,0) 

1 1 

> 72d Pr: Po) = 72 (Pal + [Pol — 2|PxO Pol) 

2 

Te Wis fe 

= Fa (2, |21Pul 

i=1 \j=l 


es (= sl) = >> |Dix|? 


i=1 j=l 
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In terms of the gini function 6(7,0) can be written as 


O(t,0) = 1-—gini(w) +1 -— gini(o) — 2(1 — gini(aAo)) 
2gini(a A a) — gini(z) — gini(c). 


Furthermore, we have 
ae 1 
gini(z) = 6(7,wg) = 1-— — — d(m, a9). 
n 
Example 2.1 In the case of two-block partitions of a set T with |T| = n the 


distance has a very simple form. Suppose that 7 = {Bo, Bi}, 0 = {Co, Ci} 
and tA o = {Doo, Doi, Dio, Dii}. Let D be the matrix 


doo - 
D= : 
e di4 


where dj; = |Dj;| for 1,7 = 0,1. The distance d(7,¢) is 


1 
6(7,0) = =) ((doo + dor)? + (dio + dis)? 


+ (doo + dio)? + (dor + di)? 
—2(d§o + db. + dig + dj) 


2 
= 772 (doodor + dy1do1 + doodio + diid40) 


2 
= 772 (doo + di1)(do1 + dio). (1) 


3 Determining Sets for Index Functions 


Let X = {21,...,2%m} be a finite set of symbols called attributes. A set 
Dom(2;) referred to as the domain of x; is attached to each attribute x;, 
and a table having the heading X is defined as a pair T = (X, R), where R, 
the content of the table is a relation on [];"., Dom(«;). The members of R 
are the tuples or the rows of the table. The weight of T is the number of 
tuples, w(T) = |R|. Note that the tables defined as above do not contain 
duplicate rows. 

We adopt the relational database theory notation, where subsets of 
table headings are denoted as strings. 
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Table 1: Tabular Representation of a Partial Function 


TT] 2 %3 Ta UH We Wi] Y 
0 0 1 0 0 1 1/1 
1 0 0 0 1 0 1 | 2 
1 0 1 1 0) 0 1/3 
0 0 1 1 1 1 0) 4 
1 1 0 0 1 1 0 | 5 
0 1 1 0 1 0 1 | 6 
1 1 0 1 0 1 1 | 7 
0 0 0 1 1 0 0 | 8 
0 1 1 1 0 1 1/9 
If t = (a1,...,@m) is a tuple in T, the restriction of t that consists of 
components that correspond to the attributes Y = 2;, --- xj, is denoted by 
t|Y] = (a;,,...,@;,) and is referred to as the projection of t on Y. 


Let k be the finite set {0,1,...,4 —1}. The number k is referred to as 
the radiz of the set k. A k-table is a table T = (X, R) with Dom x; = k for 
1l<icm. 

Consider a set of n different binary vectors of m bits referred to as 
registered vectors. An index generation function or, more briefly, an index 
function assigns to every registered vector a unique integer from 1 to n. 
A circuit implementing the index function produces a value & if its input 
matches the k'® registered vector, and 0 otherwise. The number n is the 
weight of the index generation function. Thus, an index generation function 
represents a mapping: f; {0,1}’ —> {0,1,...,n}. 

An index table is a table that describes an index function and is defined 
as a pair T = (21 ---2%my, R), where Dom(z;) = 2 and Dom(y) = {1,...,n}, 
where n = w(T). Thus, an index table is a table whose attributes are binary 
with the exception of the index attribute y that is an n-ary attribute, where 
= wl), 


Example 3.1 In Table 1 we show an (2,9)-index table that contains nine 
tuples in 2’ x 9: 
For instance, ts = (0,1,1,0,1,0, 1,6). 


If the index table T has the heading 71... ny, then T defines a collection 
Cr of subsets of the set X = 21...2, by interpreting the rows of T as 
characteristic vectors of these subsets. 
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Example 3.2 For the table T given in Example 3.1 the collection Cp consists 
of the following sets: 


Co = {£3, 26, 27}, C1 = {21, 25, X7}, 

Co = {£1, £3, £4, £7}, C3 = {23, £4, £5, Xo} 
C4 = {x1, 22, 25, 26},C5 = [22,23, 25, 27}, 
Gg = 121,22, 84, 26, 27},C7 = {raj a5}, 
Ca = {xa £3, 24, Le, 07}. 


We use next the Vapnik-Chervonenkis dimension of a collection of sets. 
This characteristic property of collection of sets is of fundamental importance 
for machine learning and data mining [13]. A collection € of subsets of a set 
X shatters a subset U of X if 


P(U) ={CnU | Ce gy. 


The family of sets shattered by € is denoted by SH(C). The size of the largest 
set in SH(C) is the Vapnik-Chervonenkis dimension VC(C) of the collection 
CG. 

For k,m €N and k < m let ¢(m,k) = ye (""). If |X| =m, there are 
o(m,k) subsets of a set X that contain at most k& elements. 

The Sauer-Shelah theorem [11,14] stipulates that if C is a collection 
of subsets of X such that |C| > ¢(m,k —1) = ~S ("’), then X contains 
a set U with |U| > k that is shattered by C. In other words, for such a 
collection VC(C) > k. 

Note that in order to shatter a d-element set U a collection C must 
contain at least 2% sets. Therefore, VC(C) = d implies 2% < |C]. 

If VC(C) = d, no set with more than d elements is shattered by €. 
Therefore, if Py(X) is the family of subsets of X that contain d or fewer 
elements, SH(C) C Pa(X), hence 


of < |e| a. (2) 


i=0 


Theorem 3.3 If C is a collection of subsets of set X with |X| = m and 
there exist k,€ © N such that 6(m,k —1) < |@| < 2°, thenk < VC(C) < £. 


Proof: Suppose that @ is a collection of subsets of a set X with |X|=m 
such that 
o(m, k -_ 1) < |C| < as (3) 
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where ¢ = |[log,(|C| + 1)|. The first inequality implies VC(C) > k by the 
Sauer-Shelah theorem. The second inequality yields VC(C) < & because @ 
must contain at least 2° sets in order to shatter a set of size 0. Thus, 
Inequalities (3) imply k < VC(C) < &. 


Example 3.4 Let X = {21,2%2,x73} and let T,,T7> be the tables shown 
below: 


Ti T2 
LY x2 L3 LY x2 x3 
1 | 0} 0 1} 0} 0 
0}; 1] 0 0} 1) 0 
1 1 | 0 0} 07; 1 
1 0 ac 1 1 0 


Let Cr,,Czr, be the collections of sets defined by these tables. We have 
|Cr,| = |Cz,| = 4 and, therefore, 


(3,1) =4=|C| < o(3,2) = 7. 


We have VC(Cz,) = 1 because Cz, shatters all one-element subsets but does 
not shatter any larger sets, and VC(Cp,) = 2 because Cz, shatters the set 
{Pijwot. 


Example 3.5 The table 7 from Example 3.1 contains 9 tuples, so we have 
m = 7 and |Cr| = 9. Since 


0(7,1)<9< 21, 


by Theorem 3.3, we may conclude that VC(C) € {2,3}. An inspection of the 
table shows that there exists a set of three attributes that is shattered by C. 
For example, one such set is {x2, 73,24}, hence VC(C) = 3. 


Suppose that all tuples of a (2,n)-table T are distinct. This allows us 
to define a multi-valued partial injective function fr : 2™ —> {0,...,n—1} 
with binary inputs and multivalued output. The set of all such partial 
functions is denoted by PF(2”,n). 


The registered vectors of an index table T’ whose heading is {21,..., 2m, 
y} can be regarded as the characteristic vectors of certain subsets of the set 
X = {X,...,%m} as shown next. Namely, if t = (a1,...,@m,5) is a tuple, 


its corresponding subset is C, = {z; | aj = 1 for 1 <i< mb}. 
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If a subset U of the heading X with |U| =k is shattered by Cr, then 
T|U] contains all binary equivalents of numbers between 0 and 2" — 1 (and 
some values may be repeated). 


Definition 3.6 Let f : 2” —> n be an index function described by the 
index table Ty = (X,R), where X = 21---%my. A set of variables V = 
{Xi,,---, Ti, } is a determining set for f if |Cy| = |Cr,|. 


In other words, if V = {2;,,...,2;,} is a determining set for the index 
function f, then the projection T;|2x;, --- x;y] is also an index table. 


Theorem 3.7 A minimal determining set for an index function f contains 
a maximal set of attributes that is shattered by Cr,. 


Proof: Let V be a minimal determining set for the index function f. 
Note that V does not contain an attribute x who has constant values (1 
or 0) in Ty for, otherwise, we would be able to drop x and the set V — {x} 
would still be a determining set. 

Thus, if 2 € V, both 0 and 1 are present under x and the set {x} is 
shattered by Cr,. This shows that V contains sets that are shattered by Cr,. 
Since P(V) is finite, it is immediate that there are maximal subsets of V 
that are shattered by Cr,. 

Observe that the projection of the table T’> on W need not contain 
distinct values, so gini(my) < 1— aT: By systematically expanding a set W 
that is shattered, that is, by adding to W subsets L of X — W it is possible 
to reach a determining set V = WL. Thus, the maximum size of a shattered 
set by @ offers a lower bound for the size of determining sets and allows 
avoiding a search of the entire collection of subsets of X. 


These considerations suggest the Algorithm 3.1 for identifying deter- 
mining sets. 


Example 3.8 In Example 3.5 we have shown that the Vapnik-Chervonenkis 
dimension of the collection of sets introduced in Example 3.1 is 3. Therefore 
any determining set for the index function specified must include at least 3 
variables. The computation of the Gini index for three-variable subsets shown 
below indicates that none of these sets has the Gini index of 0.8889 = 1 — 5; 
but there are several such sets (shown in bold characters) that have a 
maximum value of 0.8642. 
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Algorithm 3.1: Identification of Determining Sets 
Input: Index table T = (X, R) 
Output : Collection of Determining Sets DS(f) for the Index 
Function Represented by T 
1 begin 
2 set m = |X|; 
3 set DS(f) 0, dy 0, do 0; 
4 | while not (|C| > ear ("")) do 
5 | | d+; 
6 end 
7 while not (|C| < 2%) do 
s | | dat+t; 
9 end 
10 foreach W € P(X) with dy < |W| < dz and maximal Gini index 
do 
11 if W is shattered by © then 
12 foreach L € P(X —W) do 
13 if gini(mwur) > 1—+ then 
14 | add WUL to DS(f) 
15 end 
16 end 
17 end 
18 end 
19 end 


2124273: 0.8148 212974 : 0.8642 x1 2075 : 0.8642 212276 : 0.8148 

11207: 0.8148 x1473%4:0.8148 2197325 :0.8148 212326 : 0.7901 

2124327: 0.7901 212475 :0.8148 212476 : 0.8642 212477 : 0.8148 

212%5%6 : 0.8395 212527 :0.8148 x1 2677: 0.83895 xox32%4 : 0.8642 
£27325 : 0.8395 rorgr6 : 0.8148 xorgr7: 0.8395 xor4r5 : 0.8148 

r2%4X6 : 0.8395 vor4r7 : 0.8148 xor5r6 : 0.8395 xox52%7 : 0.8148 

L2Q%6xL7 : 0.8395 x32%40%5 : 0.83895 x32%4%6 : 0.8642 r324%7 : 0.8395 

130526 : 0.8395 x3%5%7:0.7901 xgr6r%7: 0.8395 x405%6 : 0.8395 

U4507 : 0.7654 vyrex7 : 0.8395 x5x6x7 : 0.7654 


These sets can be extended to a determining set. In the next table, the 
extensions of the sets of size 3 that are determining sets are shown in bold 
characters: 
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11 203X4 : 0.8889 11297305 : 0.8889 11%9%3%6 : 0.83895 2 xQ%3%7 : 0.8642 
11 %2%4X5:0.8642 x x9x4%6 : 0.8889 21 22%4x7 : 0.8642 211 22x5%6 : 0.8889 
11124527: 0.8642 x x9x—x7: 0.8642 212732425: 0.8642 211 273242%6 : 0.8642 
U1 0304X7 : 0.8642 3x41 03%5%6 : 0.8642 x103%507 : 0.8642 2 x3%627 : 0.8642 
11 0405X6 : 0.8889 1104%507 : 0.8395 x104%607 : 0.8889 11457627 : 0.8642 
L2QU3L4x5 : 0.8889 £o%374%6 : 0.8889 12%3%47%7 : 0.8889 Lox3x5%¢6 : 0.8642 
12%305X7 : 0.8642 xoxgxet7 : 0.8642 xor4rt5%6 : 0.8642 rex4%527 : 0.8395 
L2406X7 : 0.8642 xox5xet7 : 0.8642 x304%5%6 : 0.8889 13%4%5%7 : 0.8642 
130406X7 : 0.8889 x305%6%7 : 0.8642 x405%62%7 : 0.8395 


4 The Gini-based metric and data compression 


A set of attributes U of a table T = (X, R) generates a partition ty of the 
set of rows R whose blocks consists of tuples that have the same projection 
on U. 


Example 4.1 For the table introduced in Example 3.1 the partition of R 
induced by L1X2 is Ney 22 = {{to, ts, tr}, {t1, to}, {ta, te}, {ts, tg}}. 


This allows us to identify sets of attributes whose partitions are close in the 
sense of this distance. Formula (1) from Example 2.1 suggests that when 
6(7z, 7’) is small the columns corresponding to x and 2’ are rather similar. 
This allows encoding values that occur in the projection Tz’) using a single 
value that belongs to a higher radix. 

A hierarchical clustering algorithm produces a hierarchical system of 
clusters (also known as a dendrogram) as a tree. Cutting this tree at a 
certain height generates a clustering that groups together attributes that 
may be encoded together. 


Example 4.2 For the table given in Example 3.1 the mutual distances 
between attributes are given in the following table: 


Ty v2 v3 v4 U5 rs L7 

x1 | 0.0000 | 0.4938 | 0.3457 | 0.4938 | 0.4938 | 0.4938 | 0.4938 
x2 | 0.4938 | 0.0000 | 0.4938 | 0.4938 | 0.4938 | 0.4444 | 0.4938 
x3 | 0.3457 | 0.4938 | 0.0000 | 0.4938 | 0.4444 | 0.4938 | 0.4444 
x4 | 0.4938 | 0.4938 | 0.4938 | 0.0000 | 0.4444 | 0.4938 | 0.4938 
x5 | 0.4938 | 0.4938 | 0.4444 | 0.4444 | 0.0000 | 0.4444 | 0.3457 
xe | 0.4938 | 0.4444 | 0.4938 | 0.4938 | 0.4444 | 0.0000 | 0.4938 
x7 | 0.4938 | 0.4938 | 0.4444 | 0.4938 | 0.3457 | 0.4938 | 0.0000 
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Applying a single-link hierarchical clustering produces the dendrogram 
shown in Figure 1. 


i Clusterings obtaining at various cuts 


[Ce 
TV bor ccc cece 2 2 yp "(  - -- - {{x5, 27,21, 23, 24,22, 26}} 
64 
C5 
5 --------- = -p  —_- --- -- Ir {{x5, 07,21, 23, 24}, {x2, r6}} 
C4 
4 }- -- - -——_1—__ - - - -- - - -4-------4----- {{w5,@7,%1,%3}, {va}, {v2, re }} 
3h 
2 fe 
C4 Cl C3 
1 “| rT railcars ari Wecsias Gia ee 
gl 
vs L7 Ly v3 v4 v2 a) 


Figure 1: Dendrogram of Single-Link Clustering 


Algorithm 4.1 takes a dataset and a cutting height as an input and 
produces a compressed dataset together with a mapping file. 

The algorithm creates a matrix of Gini-based distances between the 
attributes of the dataset. The distance matrix is used to run single-linkage 
clustering algorithm. This hierarchy is at the provided cutting height and 
values of the attributes in each cluster are encoded into a new column with 
the name of the cluster. 

This joining is based on the assumption that we have small number of 
unique projections of dataset transactions on a set of attributes in the same 
cluster. This is done by running a group by query on the dataset projected 
on the set of attributes and enumerating the results of the query. This 
enumeration and a mapping of sets of attributes to respective cluster names 
are saved in the mapping file of the output. The new clustered columns and 
the columns for unclustered attributes form the compressed dataset. 

Our experiments involved the mushroom data set [12] in a binarized 
form. For every original column we have created as many binary columns as 
the number of unique values in the column. So, for instance, if column c; 
has had 3 values a,b,c, then 3 binary columns have been created c1_a,ci_b 
and ci_c. Whenever c; has value a, the column c;_a has value 1 and 0 
otherwise, etc. The resulting dataset had 126 attributes, and after removing 
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Algorithm 4.1: Compression Algorithm 


Input : Dataset T = (X, R) and cutting height r 
Output : Compressed Dataset T’, Mapping File 


1 begin 
2 Let distanceMatriz||R|]{|R|] be a matrix of distances between 
attributes; 
3 foreach pair of attributes (xp, 2¢) do 
4 Calculate distance 
d(xx, £2) = 2 gini(tz,,Tx,) — gini(ms,) — gini(az,); 
5 Set distanceMatriz|k]|¢] = d; 
6 Set distanceMatriz|¢|[k] = d; 
7 end 
8 Run Single Linkage Clustering Algorithm for the set of attributes 
X with the distance matrix distanceMatriz||R\]|| Rl]; 
9 foreach cluster with height <r that is strictly not included in 
any other cluster with height < r do 
10 Join all the items from the cluster into a new column named 
as the cluster; 
11 Run group by query on the dataset T for the attributes in 
the cluster; 
12 Enumerate rows in the query result; 
13 Save the new column in the output dataset T’” where the value 
for each row is based on the enumeration from the query; 
14 Save all the mappings to the file //; 
15 end 
16 Save all the columns which attributes are not included in the 
clusters into the output dataset TJ’ without change; 


17 end 
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any column that had either all zeros or all ones the new dataset had 116 
attributes left. The resulting file had size of 1,851 KB. 


The original dataset had 22 attributes and 1 attribute for the class 
(poisonous/edible) and 8124 transactions. The class column was excluded. 


We ran Algorithm 4.1 for several different cutting height values r. The 
dependency of sizes of the compressed files on the cutting height value r is 
shown in Figure 2. It can be readily seen that the best compression happens 
when the cutting height value is about 0.35. 


Compressed File Sizes vs Cutting Height 
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Figure 2: Compressed File Sizes vs Cutting Height 
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5 Conclusions 


The Gini index that was developed for statistical purposes is a member of 
a broader family of diversity measures known as generalized entropies. We 
applied this index in conjunction with the Vapnik-Chervonenks dimension of 
collections of sets to develop an algorithm that seeks to identify determining 
sets for index function and provides a lower limit to the size of such sets. 
The relationship between determining sets and the Vapnik-Chervonenks 
dimension of the collection of sets defined by an index function suggests 
that this dimension is a good proxy for the complexity of index function, a 
further research goal to be explored. 

The metric space generated by the Gini index on the set of partitions 
was used to develop a data compression algorithm starting from a clustering 
algorithm applied to table attributes. This compression is achieved by 
grouping together attributes that have similar value distributions. 

It would be interesting to examine the use of other types of entropies 
(e.g. Shannon’s entropy) for solving these problems. 
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